Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

نویسندگان

Pratik Chaudhari

Anna Choromanska

Stefano Soatto

Yann LeCun

Carlo Baldassi

Christian Borgs

Jennifer T. Chayes

Levent Sagun

Riccardo Zecchina

چکیده

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent neural networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Convergence of SGD Training of Neural Networks

Neural networks are usually trained by some form of stochastic gradient descent (SGD)). A number of strategies are in common use intended to improve SGD optimization, such as learning rate schedules, momentum, and batching. These are motivated by ideas about the occurrence of local minima at different scales, valleys, and other phenomena in the objective function. Empirical results presented he...

متن کامل

"Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks

Stochastic Gradient Descent (SGD) is arguably the most popular of the machine learning methods applied to training deep neural networks (DNN) today. It has recently been demonstrated that SGD can be statistically biased so that certain elements of the training set are learned more rapidly than others. In this article, we place SGD into a feedback loop whereby the probability of selection is pro...

متن کامل

Modified Convolutional Neural Network Based on Dropout and the Stochastic Gradient Descent Optimizer

This study proposes a modified convolutional neural network (CNN) algorithm that is based on dropout and the stochastic gradient descent (SGD) optimizer (MCNN-DS), after analyzing the problems of CNNs in extracting the convolution features, to improve the feature recognition rate and reduce the time-cost of CNNs. The MCNN-DS has a quadratic CNN structure and adopts the rectified linear unit as ...

متن کامل

Rectified linear neural networks with tied-scalar regularization for LVCSR

It is known that rectified linear deep neural networks (RL-DNNs) can consistently outperform the conventional pretrained sigmoid DNNs even with a random initialization. In this paper, we present another interesting and useful property of RLDNNs that we can learn RL-DNNs with a very large batch size in stochastic gradient descent (SGD). Therefore, the SGD learning can be easily parallelized amon...

متن کامل

Stochastic gradient optimization of importance sampling for the efficient simulation of digital communication systems

Importance sampling (IS) techniques offer the potential for large speed-up factors for bit error rate (BER) estimation using Monte Carlo (MC) simulation. To obtain these speed-up factors, the IS parameters specifying the simulation probability density function (pdf) must be carefully chosen. With the increased complexity in communication systems, analytical optimization of IS parameters can be ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1611.01838 شماره

صفحات -

تاریخ انتشار 2016

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

نویسندگان

چکیده

منابع مشابه

On the Convergence of SGD Training of Neural Networks

"Oddball SGD": Novelty Driven Stochastic Gradient Descent for Training Deep Neural Networks

Modified Convolutional Neural Network Based on Dropout and the Stochastic Gradient Descent Optimizer

Rectified linear neural networks with tied-scalar regularization for LVCSR

Stochastic gradient optimization of importance sampling for the efficient simulation of digital communication systems

عنوان ژورنال:

اشتراک گذاری